09/22/2020

Reminders

  • Assignment 1
    • Due: 09/23/2020 11:55pm (EST)
  • Assignment 2 is available now
    • Due: 10/02/2020 11:55pm (EST)
  • Office hours
    • Monday 9 - 10am (EST)
    • Wednesday 12:30 - 1:30pm (EST)
  • Office hour notes
    • Available on NYU Classes under the “Resources” tab

Fun slide

“Get ten bagels when you arrive at the bakery. If you see muffins, get one.”

How many bagels will you buy?

  • A. 10

  • B. 1

Another fun slide

“Get ten bagels when you arrive at the bakery. If you see muffins, get one.”

has_Muffin <- TRUE
shopping_at <- function (location) {
  if (location == "bakery") {
    if (has_Muffin == TRUE) {
      bagel <- 1
    } else {
      bagel <- 10
    }
  }
  return(bagel)
}
shopping_at(location = "bakery")
## [1] 1

One more fun slide

has_Muffin <- TRUE
shopping_at <- function (human = TRUE, location) {
  if (location == "bakery") {
    if (human == TRUE) {
      if (has_Muffin == TRUE) {
        bagel <- 10
        muffin <- 1
      } else {
        bagel <- 10
      }
    } else {
      if (has_Muffin == TRUE) {
        bagel <- 1
      } else {
        bagel <- 10
      }
    }
  }
  return(bagel)
}
shopping_at(human = TRUE, location = "bakery")  # 10
shopping_at(human = FALSE, location = "bakery") # 1

How about another fun slide?

To make things right, try to talk like this:

When you arrive at the bakery:

  • If there are muffins, buy 1 muffin and 10 bagels. Checkout.

  • If there is no muffin, buy 10 bagels. Checkout.

So many fun slides

How we communicate:

  • "Get some bagels. Maybe we should get some muffins. I don’t know. If muffins look good, get some. Oh! Don’t forget brownies. If you buy all three, get less bagels cause that will be too much.

Human: totally fine! don’t worry I got it!

Computer:

  • error: “some” is not defined
  • logic error: “maybe”
  • logic error: “I don’t know, either.”
  • error: “good” is not defined
  • error: “less” is not defined
  • logic warning: “If you buy all three…”

Today’s topics

  • Math Review
    • Ordinary Least Squares Linear Regression
  • Programming in R
    • Simple regression analysis from scratch

Math Review

Linear Regression

There are many types of regression analysis.

In this course, we will focus on linear regression.

\[ Y_i = \beta_0 + \beta_1 \times X_i + \varepsilon \] where \(\beta_0\) is the intercept, \(\beta_1\) is the slope, and \(\varepsilon\) is the error.

We need some data

Let’s create 100 data points with two variables: X and Y, which are highly correlated.

\[ Y_i = X_i + \varepsilon \]

##          X        Y       Diff.
## 1 20.38222 21.37346  0.99123880
## 2 20.51409 20.29707 -0.21702244
## 3 17.89188 17.32735 -0.56452933
## 4 19.27714 19.31970  0.04256237
## 5 18.61547 19.46676  0.85129379
## 6 18.24215 18.33422  0.09207247

Let’s visualize the simulated data set

p <- ggplot(dat_sim, aes(x = X, y = Y)) +
  geom_point() + 
  theme_minimal()
ggplotly(p)
paste("Correlation Coefficient: ", 
      round(cor(x = dat_sim$X, y = dat_sim$Y), digits = 3),
      sep = "")
## [1] "Correlation Coefficient: 0.9"

We need a model

Let’s fit a linear regression model. A linear regression defines a line:

\[ \hat{Y} = \beta_0 + \beta_1 X_i \]

What does this mean?

Regression line is the line that minimizes the squared distances/errors between \(Y_i\) and the line. In this case:

\[ (\beta_0, \beta_1) = \mathcal{argmin} \sum_{i=1}^{100} (Y_i - \hat {Y_i})^2 = \mathcal{argmin} \sum_{i=1}^{100} (Y_i - (\beta_0 + \beta_1 X_i))^2 \]

We love math!

\[ \begin{align} \mathcal{Q}(\beta_0, \beta_1) & = \sum_{i=1}^{n} (Y_i - \hat {Y_i})^2 = \sum_{i=1}^{n} (Y_i - (\beta_0 + \beta_1 X_i))^2 \\ & = \sum_{i=1}^n (Y_i - \beta_0 -\beta_1 X_i)^2 \\ & = \sum_{i=1}^n Y_{i}^2 + n \beta_0^2 + \beta_1^2 \sum_{i=1}^n X_i^2 - 2 \beta_0 \sum_{i=1}^n Y_i - 2 \beta_1 \sum_{i=1}^n Y_i X_i + 2 \beta_0 \beta_1 \sum_{i=1}^n X_i \\ \end{align} \]

We love math so much!

\[ \mathcal{Q}(\beta_0, \beta_1) = \sum_{i=1}^n Y_{i}^2 + n \beta_0^2 + \beta_1^2 \sum_{i=1}^n X_i^2 - 2 \beta_0 \sum_{i=1}^n Y_i - 2 \beta_1 \sum_{i=1}^n Y_i X_i + 2 \beta_0 \beta_1 \sum_{i=1}^n X_i \]

\[ \frac{\partial Q}{\partial \beta_0} = 2n\beta_0 - 2n\overline{Y} + 2n\beta_1 \overline{X} \]

\[ \frac{\partial Q}{\partial \beta_1} = 2\beta_1 \sum_{i=1}^n X_i^2 + 2n\beta_0 \overline{X} - 2n\sum_{i=1}^n Y_i X_i \]

where \(\overline{Y}\) is the mean of \(Y\), and \(\overline{X}\) is the mean of \(X\).

We love math! (Yes we do!)

\(\frac{\partial \mathcal{Q}}{\partial \beta_0}\) is the change of line equation if its intercept (\(\beta_0\)) is varied and its slope (\(\beta_1\)) is kept constant.

\(\frac{\partial \mathcal{Q}}{\partial \beta_1}\) is the change of line equation if its slope (\(\beta_1\)) is varied and its intercept (\(\beta_0\)) is kept constant.

Almost there …

To get the Ordinary Least Squares (OLS), we need to make sure there is no freedom of movement. This means, check if the second derivatives are positive.

\[ \frac{\partial \mathcal{Q}}{\partial \beta_0} = 2n \beta_0 - 2n \overline{Y} + 2n \beta_1 \overline{X} \]

\[ \frac{\partial \mathcal{Q}}{\partial \beta_0^2} = 2n \]

\[ \frac{\partial \mathcal{Q}}{\partial \beta_1} = 2 \beta_1 \sum_{i=1}^n X_i^2 + 2n \beta_0 \overline{X} - 2n \sum_{i=1}^n Y_i X_i \]

\[ \frac{\partial \mathcal{Q}}{\partial \beta_1^2} = 2 \sum_{i=1}^n X_i^2 \]

\[ \frac{\partial \mathcal{Q}}{\partial \beta_0 \beta_1} = 2 \sum_{i=1}^n X_i = 2n \overline{X} \]

Final math slide I promise

Then, solve \(\frac{\partial \mathcal{Q}}{\partial \beta_0}\) and \(\frac{\partial \mathcal{Q}}{\partial \beta_1}\). We can get:

\[ \frac{\partial Q}{\partial \beta_0} = 2n\beta_0 - 2n\overline{Y} + 2n\beta_1 \overline{X} \]

\[ \frac{\partial Q}{\partial \beta_1} = 2\beta_1 \sum_{i=1}^n X_i^2 + 2n\beta_0 \overline{X} - 2n\sum_{i=1}^n Y_i X_i \]

\[ \beta_0 = \overline{Y} - \beta_1 \overline{X} \]

\[ \beta_1 = \frac{\sum_{i=1}^n (X_i - \overline{X})(Y_i - \overline{Y})} {\sum_{i=1}^n (X_i - \overline{X})^2} = \frac{\sum_{i=1}^n (X_i - \overline{X})(Y_i - \overline{Y})} {n \times Var.(X_i)} = \frac{Corr.(X, Y) \times S.D.(Y)}{S.D.(X)} \] The higher the correlation, the higher the slope.

Correlation and slope

Programming in R

Getting the data set

The Data Set

Number of rows: 19,961

Variables

  • bib: number on the runner’s bib (ID)

  • name: initials

  • sex: biological sex

  • age

  • city: home city

  • net_sec: time record in seconds (accounting for staggered starting time)

  • clock_sec: time record in seconds (based on the clock)

  • pace_sec: average time per mile, in seconds

  • event: either the “10 Mile” race or the “5K”

Import the data set to RStudio

# Set working directory

# Load the data set
dat <- read.csv("../../../data/run17.csv")

Check dimensions

Dimensions: number of rows, number of columns

## [1] 19961

Descriptive Analysis

Hypotheses

\(H_0\): For 10 Mile race, the mean of the net finish time of male runners is the same as female runners.

\(H_1\): For 10 Mile race, the mean of the net finish time of male runners is not the same as female runners.

Subset the data set

dat_10m <- dat[dat$even == "10 Mile", ]
dat_10m$sex <- ifelse(dat_10m$sex == "M", 1, 0)

Levene’s test

To examine if pooled variance is suitable.

leveneTest(net_sec ~ factor(sex), data = dat_10m)
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     1  34.267 4.891e-09 ***
##       17440                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

T-test

t.test(net_sec ~ sex, data = dat_10m, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  net_sec by sex
## t = 40.434, df = 14287, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  582.1310 641.4466
## sample estimates:
## mean in group 0 mean in group 1 
##        6113.618        5501.829

Linear regression

summary(lm(net_sec ~ sex, data = dat_10m))
## 
## Call:
## lm(formula = net_sec ~ sex, data = dat_10m)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2896.62  -687.62   -38.62   640.33  2896.17 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6113.618      9.454  646.69   <2e-16 ***
## sex         -611.789     14.927  -40.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 966.2 on 17440 degrees of freedom
## Multiple R-squared:  0.08786,    Adjusted R-squared:  0.0878 
## F-statistic:  1680 on 1 and 17440 DF,  p-value: < 2.2e-16

Visualize regression results

plot(x = dat_10m$sex, y = dat_10m$net_sec, type = "p")
abline(a = 6113.618, b = -611.789, lwd = 2)

Let’s do another regression analysis

On average, the older, the slower.

summary(lm2 <- lm(net_sec ~ age, data = dat_10m))
## 
## Call:
## lm(formula = net_sec ~ age, data = dat_10m)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2969.38  -700.70   -20.42   672.78  2670.51 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5388.0324    27.2989  197.37   <2e-16 ***
## age           12.9782     0.7087   18.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1002 on 17440 degrees of freedom
## Multiple R-squared:  0.01886,    Adjusted R-squared:  0.01881 
## F-statistic: 335.3 on 1 and 17440 DF,  p-value: < 2.2e-16

Visualize regression results

plot(x = dat_10m$age, y = dat_10m$net_sec, type = "p", col = "grey")
abline(a = 5388.0324, b = 12.9782, lwd = 2)

De-mean

dat_10m$age0 <- dat_10m$age - mean(dat_10m$age)
summary(lm3 <- lm(net_sec ~ age0, data = dat_10m))
## 
## Call:
## lm(formula = net_sec ~ age0, data = dat_10m)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2969.38  -700.70   -20.42   672.78  2670.51 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5868.2292     7.5877  773.38   <2e-16 ***
## age0          12.9782     0.7087   18.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1002 on 17440 degrees of freedom
## Multiple R-squared:  0.01886,    Adjusted R-squared:  0.01881 
## F-statistic: 335.3 on 1 and 17440 DF,  p-value: < 2.2e-16

Visualize regression results

plot(x = dat_10m$age0, y = dat_10m$net_sec, type = "p", col = "grey")
abline(a = 5868.2292, b = 12.9782, lwd = 2)
abline(v = 0, lty = "dashed", col = "red")

Standardize

dat_10m$age1 <- dat_10m$age0 / sd(dat_10m$age0)
summary(lm4 <- lm(net_sec ~ age1, data = dat_10m))
## 
## Call:
## lm(formula = net_sec ~ age1, data = dat_10m)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2969.38  -700.70   -20.42   672.78  2670.51 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5868.229      7.588  773.38   <2e-16 ***
## age1         138.950      7.588   18.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1002 on 17440 degrees of freedom
## Multiple R-squared:  0.01886,    Adjusted R-squared:  0.01881 
## F-statistic: 335.3 on 1 and 17440 DF,  p-value: < 2.2e-16

Visualize regression results

plot(x = dat_10m$age1, y = dat_10m$net_sec, type = "p", col = "grey", xlab = "SD")
abline(a = 5868.229, b = 138.950, lwd = 2)
abline(v = 0, lty = "dashed", col = "red")